Machine learning continues to grow in importance for many organizations across nearly all domains. Examples include:
In essence, these tasks all seek to learn from data. To address each scenario, we use a given set of features to train an algorithm and extract insights. These algorithms, or learners, can be classified according to the amount and type of supervision provided during training. The two main groups this book focuses on includes: supervised learners that are used to construct predictive models, and unsupervised learners that are used to build descriptive models. Which type you will need to use depends on the learning task you hope to accomplish.
A predictive model is used for tasks that involve the prediction of a given output using other variables and their values (features) in the data set. Or as stated by @apm, predictive modeling is “the process of developing a mathematical tool or model that generates an accurate prediction” (p. 2). The learning algorithm in a predictive model attempts to discover and model the relationship among the target response (the variable being predicted) and the other features (aka predictor variables). Examples of predictive modeling include:
Each of these examples have a defined learning task. They each intend to use attributes (\(x\)) to predict an outcome measurement (\(Y\))
Throughout this text I will use various terms interchangeably for:
- $X$: "predictor variables", "independent variables", "attributes", "features", "predictors"
- $Y$: "target variable", "dependent variable", "response", "outcome measurement"
'block' 㠼㸲́A㤼㸳攼㸰㤼㸵㤼㸴㠼㸳R㠼㸳}㠼㸳㤼㸳㠼㸳h㠼㸲܂戼㹤㠼㸲͊O㤼㸵㤼㸴㠼㸳R㠼㸳}㠼㸳㤼㸳㠼㸳h㠼㸱A
㤼㸱㠼㸰㠼㹤攼㹣㠼㸹\㠼㸲ȃv㠼㸳㠼㹤㠼㸳O㠼㸳㠼㸹㠼㸳㠼㸰㠼㸲܂戼㹤㠼㸲̓o㠼㸳b㠼㸳` 㠼㸳t㠼㸳@㠼㸳C㠼㸳㠼㹢㠼㸲Ƃ戼㸵㠼㸲ĔF㠼㹥愼㹦㠼㸲戼㸳㠼㸲攼㹡㠼㸲Ă愼㸲㠼㸲܂戼㸹㠼㸲昼㸱㠼㸱B
The predictive modeling examples above describe what is known as supervised learning. The supervision refers to the fact that the target values provide a supervisory role, which indicates to the learner the task it needs to learn. Specifically, given a set of data, the learning algorithm attempts to optimize a function (the algorithmic steps) to find the combination of feature values that results in a predicted value that is as close to the actual target output as possible.
Supervised learning problems resolve around two primary themes: regression and classification.
In supervised learning, the training data you feed the algorithm includes the desired solutions. Consequently, the solutions can be used to help _supervise_ the training process to find the optimal algorithm parameters.
'block' 㠼㸲́A㤼㸳攼㸰㤼㸵㤼㸴㠼㸳R㠼㸳}㠼㸳㤼㸳㠼㸳h㠼㸲܂戼㹤㠼㸲͊O㤼㸵㤼㸴㠼㸳R㠼㸳}㠼㸳㤼㸳㠼㸳h㠼㸱A
㤼㸱㠼㸰㠼㹤攼㹣㠼㸹\㠼㸲ȃv㠼㸳㠼㹤㠼㸳O㠼㸳㠼㸹㠼㸳㠼㸰㠼㸲܂戼㹤㠼㸲̓o㠼㸳b㠼㸳` 㠼㸳t㠼㸳@㠼㸳C㠼㸳㠼㹢㠼㸲Ƃ戼㸵㠼㸲ĔF㠼㹥愼㹦㠼㸲戼㸳㠼㸲攼㹡㠼㸲Ă愼㸲㠼㸲܂戼㸹㠼㸲昼㸱㠼㸱B
When the objective of our supervised learning is to predict a numeric outcome, we refer to this as a regression problem (not to be confused with linear regression modeling). Regression problems revolve around predicting output that falls on a continuous numeric spectrum. In the examples above predicting home sales prices and time to market reflect a regression problem because the output is numeric and continuous. This means, given the combination of predictor values, the response value could fall anywhere along the continuous spectrum. Figure @ref(fig:regression-problem) illustrates average home sales prices as a function of two home features: year built and total square footage. Depending on the combination of these two features, the expected home sales price could fall anywhere along the plane.
When the objective of our supervised learning is to predict a categorical response, we refer to this as a classification problem. Classification problems most commonly revolve around predicting a binary or multinomial response measure such as:
However, when we apply machine learning models for classification problems, rather than predict a particular class (i.e. “yes” or “no”), we often predict the probability of a particular class (i.e. yes: .65, no: .35). Then the class with the highest probability becomes the predicted class. Consequently, even though we are performing a classification problem, we are still predicting a numeric output (probability). However, the essence of the problem still makes it a classification problem.
Although there are machine learning algorithms that can be applied to regression problems but not classification and vice versa, the supervised learning algorithms I cover in this book can be applied to both. These algorithms have become the most popular machine learning applications in recent years.
Although the chapters that follow will go into detail on each algorithm, the following provides a quick reference guide that compares and contrasts some of their features. Moreover, I provide recommended base learner packages that I have found to scale well with typical rectangular data analyzed by organizations.
| Characteristics | Regularized GLM | Random Forest | Gradient Boosting Machine | Deep Learning |
|---|---|---|---|---|
| Allows n < p | ||||
| Provides automatic feature selection | ||||
| Handles missing values | ||||
| No feature pre-processing required | ||||
| Robust to outliers | ||||
| Easy to tune | ||||
| Computational speed | ||||
| Predictive power | ||||
|
Preferred regression base learner |
glmnet h2o.glm |
ranger h2o.randomForest |
xgboost h2o.gbm |
keras h2o.deeplearning |
|
Preferred classifciation base learner |
glmnet h2o.glm |
ranger h2o.randomForest |
xgboost h2o.gbm |
keras h2o.deeplearning |
Unsupervised learning, in contrast to supervised learning, includes a set of statistical tools to better understand and describe your data but performs the analysis without a target variable. In essence, unsupervised learning is concerned with identifying groups in a data set. The groups may be defined by the rows (i.e., clustering) or the columns (i.e., dimension reduction); however, the motive in each case is quite different.
The goal of clustering is to segment observations into similar groups based on the observed variables. For example, to divide consumers into different homogeneous groups, a process known as market segmentation. In dimension reduction, we are often concerned with reducing the number of variables in a data set. For example, classical regression models break down in the presence of highly correlated features. Dimension reduction techniques provide a method to reduce the feature set to a potentially smaller set of uncorrelated variables. These variables are often used as the input variables to downstream supervised models like.
Unsupervised learning is often performed as part of an exploratory data analysis. However, the exercise tends to be more subjective, and there is no simple goal for the analysis, such as prediction of a response. Furthermore, it can be hard to assess the quality of results obtained from unsupervised learning methods. The reason for this is simple. If we fit a predictive model using a supervised learning technique (i.e. linear regression), then it is possible to check our work by seeing how well our model predicts the response Y on observations not used in fitting the model. However, in unsupervised learning, there is no way to check our work because we don’t know the true answer—the problem is unsupervised.
However, the importance of unsupervised learning should not be overlooked and techniques for unsupervised learning are used in organizations to:
These questions, and many more, can be addressed with unsupervised learning. Moreover, often the results of an unsupervised model can be used as inputs to downstream supervised learning models.
In his seminal 2001 paper, Leo Breiman popularized the phrase: “the multiplicity of good models.” The phrase means that for the same set of input variables and prediction targets, complex machine learning algorithms can produce multiple accurate models with very similar, but not the exact same, internal architectures.
Figure @ref(fig:error-surface) is a depiction of a non-convex error surface that is representative of the error function for a machine learning algorithm with two inputs — say, a customer’s income and a customer’s age, and an output, such as the same customer’s probability of redeeming a coupon. This non-convex error surface with no obvious global minimum implies there are many different ways complex machine learning algorithms could learn to weigh a customer’s income and age to make a good decision about if they are likely to redeem a coupon. Each of these different weightings would create a different function for making coupon redemption (and therefore marketing) decisions, and each of these different functions would have different explanations.
All of this is an obstacle to data scientists. On one hand, different models can have widely different predictions based on the same feature set. Even models built from the same algorithm but with different hyperparameters can lead to different results. Consequently, practitioners should understand how different implementation of algorithms differ, which can cause variance in their results (i.e., a default xgboost model can produce very different results from a default gbm model, even though they both implement gradient boosting machines).
Alternatively, data scientists can experience very similar predictions from different models based on the same feature set. However, these models will have very different logic and structure leading to different interpretations. Consequently, preactitionaer should understand how to interpret different types of models.
This book provide you with a fundamental understanding to compare and contrast models and even package implementations of similar algorithms. Several machine learning interpretability techniques will be demonstrated to help you understand what is driving model and prediction performance. This will allow you to be more effective and efficient in applying and understandin multiple good models.
The XX data sets chosen for this book allow us to illustrate the different features of our machine learning algorithms. Since the goal of this book is to demonstrate how to implement R’s ML stack, I make the assumption that you have already spent significant time cleaning and getting to know your data via exploratory data analysis. This would allow you to perform many necessary tasks prior to the ML tasks outlined in this book such as:
Consequently, the exemplar data sets I use throughout this book have, for the most part, gone through the necessary cleaning process. These data sets are all freely available and include:
AmesHousing package [@R-ames]?AmesHousing::ames_raw# access data
ames <- AmesHousing::make_ames()
# initial dimension
dim(ames)[1] 2930 81
# response variable
head(ames$Sale_Price)[1] 215000 105000 172000 244000 189900 195500
You can see the entire data cleaning process to transform the raw Ames housing data (`AmesHousing::ames_raw`) to the final clean data (`AmesHousing::make_ames`) that we will use in machine learning algorithms throughout this book at:
https://github.com/topepo/AmesHousing/blob/master/R/make_ames.R
'block' 㠼㸲́A㤼㸳攼㸰㤼㸵㤼㸴㠼㸳R㠼㸳}㠼㸳㤼㸳㠼㸳h㠼㸲܂戼㹤㠼㸲͊O㤼㸵㤼㸴㠼㸳R㠼㸳}㠼㸳㤼㸳㠼㸳h㠼㸱A
㤼㸱㠼㸰㠼㹤攼㹣㠼㸹\㠼㸲ȃv㠼㸳㠼㹤㠼㸳O㠼㸳㠼㸹㠼㸳㠼㸰㠼㸲܂戼㹤㠼㸲̓o㠼㸳b㠼㸳` 㠼㸳t㠼㸳@㠼㸳C㠼㸳㠼㹢㠼㸲Ƃ戼㸵㠼㸲ĔF㠼㹥愼㹦㠼㸲戼㸳㠼㸲攼㹡㠼㸲Ă愼㸲㠼㸲܂戼㸹㠼㸲昼㸱㠼㸱B
Attrition (i.e. “Yes”, “No”)rsample package [@R-rsample]?rsample::attrition# access data
attrition <- rsample::attrition
# initial dimension
dim(attrition)[1] 1470 31
# response variable
head(attrition$Attrition)[1] Yes No Yes No No No
Levels: No Yes
V785 (i.e. numbers to predict: 0, 1, …, 9) # load training data https://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop/mnist/train.csv.gz
train <- data.table::fread("C:/Users/kojikm.mizumura/Desktop/Data Science/Hands-on ML/data/mnist_train.csv", data.table = FALSE)
# load test data https://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop/mnist/test.csv.gz
test <- data.table::fread("C:/Users/kojikm.mizumura/Desktop/Data Science/Hands-on ML/data/mnist_test.csv", data.table = FALSE)
# initial dimension
dim(train)
# response variable
head(train$V785)TODO: get unsupervised data sets for clustering and dimension reduction examples
Machine learning is a very iterative process. If performed and interpreted correctly, we can have great confidence in our outcomes. If not, the results will be useless. Approaching machine learning correctly means approaching it strategically by spending our data wisely on learning and validation procedures, properly pre-processing variables, minimizing data leakage, tuning hyperparameters, and assessing model performance. Before introducing specific algorithms, this chapter introduces concepts that are commonly required in the supervised machine learning process and that you’ll see briskly covered in each chapter.
library(rsample)
library(caret)
library(h2o)
library(dplyr)
# turn off progress bars
h2o.no_progress()
# launch h2o
h2o.init() Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 26 minutes 18 seconds
H2O cluster timezone: Asia/Tokyo
H2O data parsing timezone: UTC
H2O cluster version: 3.20.0.8
H2O cluster version age: 1 month and 15 days
H2O cluster name: H2O_started_from_R_KojiKM.Mizumura_oly465
H2O cluster total nodes: 1
H2O cluster total memory: 1.96 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: Algos, AutoML, Core V3, Core V4
R Version: R version 3.5.0 (2018-04-23)
To illustrate some of the concepts, we will use the Ames Housing data and employee attrition data introduced in Chapter @ref(intro). Throughout this book, I’ll demonstrate approaches with regular data frames. However, since many of the supervised machine learning chapters leverage the h2o package, we’ll also show how to do some of the tasks with H2O objects. This requires your data to be in an H2O object, which you can convert any data frame easily with as.h2o.
If you try to convert the original `rsample::attrition` data set to an H2O object an error will occur. This is because several variables are _ordered factors_ and H2O has no way of handling this data type. Consequently, you must convert any ordered factors to unordered.
'block' 㠼㸲́A㤼㸳攼㸰㤼㸵㤼㸴㠼㸳R㠼㸳}㠼㸳㤼㸳㠼㸳h㠼㸲܂戼㹤㠼㸲͊O㤼㸵㤼㸴㠼㸳R㠼㸳}㠼㸳㤼㸳㠼㸳h㠼㸱A
㤼㸱㠼㸰㠼㹤攼㹣㠼㸹\㠼㸲ȃv㠼㸳㠼㹤㠼㸳O㠼㸳㠼㸹㠼㸳㠼㸰㠼㸲܂戼㹤㠼㸲̓o㠼㸳b㠼㸳` 㠼㸳t㠼㸳@㠼㸳C㠼㸳㠼㹢㠼㸲Ƃ戼㸵㠼㸲ĔF㠼㹥愼㹦㠼㸲戼㸳㠼㸲攼㹡㠼㸲Ă愼㸲㠼㸲܂戼㸹㠼㸲昼㸱㠼㸱B
# ames data
library(tidyverse)
ames <- AmesHousing::make_ames()
ames.h2o <- as.h2o(ames)
# attrition data
churn <- rsample::attrition %>%
mutate_if(is.ordered, factor, ordered = FALSE)
churn.h2o <- as.h2o(churn)A major goal of the machine learning process is to find an algorithm \(f(x)\) that most accurately predicts future values (\(y\)) based on a set of inputs (\(x\)). In other words, we want an algorithm that not only fits well to our past data, but more importantly, one that predicts a future outcome accurately. This is called the generalizability of our algorithm. How we “spend” our data will help us understand how well our algorithm generalizes to unseen data.
To provide an accurate understanding of the generalizability of our final optimal model, we split our data into training and test data sets:
Given a fixed amount of data, typical recommendations for splitting your data into training-testing splits include 60% (training) - 40% (testing), 70%-30%, or 80%-20%. Generally speaking, these are appropriate guidelines to follow; however, it is good to keep in mind that as your overall data set gets smaller,
In today’s data-rich environment, typically, we are not lacking in the quantity of observations, so a 70-30 split is often sufficient. The two most common ways of splitting data include simple random sampling and stratified sampling.
The simplest way to split the data into training and test sets is to take a simple random sample. This does not control for any data attributes, such as the percentage of data represented in your response variable (\(y\)). There are multiple ways to split our data. Here we show four options to produce a 70-30 split (note that setting the seed value allows you to reproduce your randomized splits):
knitr::include_graphics("images/data_split.png")
# baseR
set.seed(123)
index_1 <- sample(1:nrow(ames), round(nrow(ames)*0.7))
train_1 <- ames[index_1,]
test_1 <- ames[-index_1,]
dim(train_1)[1] 2051 81
# caret package
set.seed(123)
library(caret)
library(AmesHousing)
index_2 <- createDataPartition(ames$Sale_Price, p=0.7, list=FALSE)
train_2 <- ames[index_2,]
test_2 <- ames[-index_2,]
# rsample package
library(rsample)
split_1 <- initial_split(ames,prop=0.7)
train_3 <- training(split_1)
test_3 <- testing(split_1)
# h2o package
library(h2o)
split_2 <- h2o.splitFrame(ames.h2o, ratios = 0.7, seed = 123)
train_4 <- split_2[[1]]
test_4 <- split_2[[2]]Since this samplint approach will randomly sample across the distribution of \(y\) (Sale_Price in our example), you will typically result in a similar distribution between your training and test sets as illustrated below.
However, if we want to explicitly control our sampling so that our training and test sets have similar \(y\) distributions, we can use stratified sampling. This is more common with classification problems where the response variable may be imbalanced (90% of observations with response “Yes” and 10% with response “No”).
However, we can als apply to regression problems for data sets that have a small sample size and where the response variable diviates strongly from normality. With a continuous response variable, stratified sampling will break \(y\) down into quartiles and randomly sample from each quantile. Consequently, this will help ensure a balanced representation of the response distribution in both training and test sets.
The easiest way to perform stratified sampling on a response variable is to use the rsample package, where you specify the response variable to stratafy. The following illustrates that in our original employee attrition data we have an imbalanced response (No:84%, Yes:16%). By enforcing stratified sampling both our training and testing sets have approximately equal response distributions.
# original response distribution
table(churn$Attrition) %>% prop.table()
No Yes
0.8387755 0.1612245
# stratified sampling with rsample package
set.seed(123)
split_strata <- initial_split(churn, prop=0.7, strata="Attrition")
train_strat <- training(split_strata)
test_strat <- testing(split_strata)
# consistent response ratio between train & test
table(train_strat$Attrition) %>% prop.table()
No Yes
0.838835 0.161165
table(test_strat$Attrition) %>% prop.table()
No Yes
0.8386364 0.1613636
Feature engineering generally refers to the process of adding, deleting, and transforming the variables to be applied to your machine learning algorithms. Feature engineering is a significant process and requires you to speed substantial time understanding your data… or as Leo Breiman said “live with your data before you plunge into modeling.”
Although this book primarily focused on applying machine learning algorithms, feature engineering can make or break an algorithm’s predictive ability. We will not cover all the potential ways of implementing feature engineering: however, we will cover a few fundamental pre-processing tasks that can significantly improve modeling performance.
To learn more about feature engineering check out Feature Engineering for Machine Learning by @zheng2018feature and Max Kuhn’s upcoming book Feature Engineering and Selection: A Practical Approach for Predictive Models.
Although not a requirement, normalizing the distribution of the response variable by using a transformation can lead to a big improvement, especially for parametric models. As we saw in the data splitting section, our response variable Sale_Price is right skewed.
ggplot(train_1, aes(x=Sale_Price))+
geom_density(trim=T)+
geom_density(data=test_1, trim=T, col="red")To normalize, we have a few options:
Option 1: normalize with alog transformation. This will transform most right skewed distributions to be approximately normal.
# log transformation
train_log_y <- log(train_1$Sale_Price)
test_log_y <- log(test_1$Sale_Price)
library(dplyr)
train_log_y %>% as_tibble() %>% dim()[1] 2051 1
Sampling is a random process so setting the random number generator with a common seed allows for reproducible results. Throughout this book I will use the number _123_ often for reproducibility but the number itself has no special meaning.'block' 㠼㸲́A㤼㸳攼㸰㤼㸵㤼㸴㠼㸳R㠼㸳}㠼㸳㤼㸳㠼㸳h㠼㸲܂戼㹤㠼㸲͊O㤼㸵㤼㸴㠼㸳R㠼㸳}㠼㸳㤼㸳㠼㸳h㠼㸱A
㤼㸱㠼㸰㠼㹤攼㹣㠼㸹\㠼㸲ȃv㠼㸳㠼㹤㠼㸳O㠼㸳㠼㸹㠼㸳㠼㸰㠼㸲܂戼㹤㠼㸲̓o㠼㸳b㠼㸳` 㠼㸳t㠼㸳@㠼㸳C㠼㸳㠼㹢㠼㸲Ƃ戼㸵㠼㸲ĔF㠼㹥愼㹦㠼㸲戼㸳㠼㸲攼㹡㠼㸲Ă愼㸲㠼㸲܂戼㸹㠼㸲昼㸱㠼㸱B
IF your response has negative values that a log transformation will produce NaNs. IF these neative values are small (between -0.99 to 0) then you can apply log1p which adds 1 to the value prior to applying a log transformation. If your data consists of negative equal to or less than -1, use the Yeo Johnson transformation mentioned next.
train_log_y %>%
as.tibble() %>%
rename(train_log=value) %>%
ggplot(aes(train_log))+
geom_density()Option 2: Use a Box Cox Transformation. A Box Cox transformation is more flexibile and will find the transformation from a family of power transforms that will transform the variable as close as possible to a normal distribution.
log(-.5)
[1] NaN
log1p(-.5)
[1] -0.6931472
# Box Cox transformation
library(forecast)
lambda <- forecast::BoxCox.lambda(train_1$Sale_Price)
train_bc_y <- forecast::BoxCox(train_1$Sale_Price, lambda)
test_bc_y <- forecast::BoxCox(test_1$Sale_Price, lambda)We can see that in this example, the log transformation and Box Cox both do about equally well in transforming our response variable to be normally distributed.
'block' 㠼㸲́A㤼㸳攼㸰㤼㸵㤼㸴㠼㸳R㠼㸳}㠼㸳㤼㸳㠼㸳h㠼㸲܂戼㹤㠼㸲͊O㤼㸵㤼㸴㠼㸳R㠼㸳}㠼㸳㤼㸳㠼㸳h㠼㸱A
㤼㸱㠼㸰㠼㹤攼㹣㠼㸹\㠼㸲ȃv㠼㸳㠼㹤㠼㸳O㠼㸳㠼㸹㠼㸳㠼㸰㠼㸲܂戼㹤㠼㸲̓o㠼㸳b㠼㸳` 㠼㸳t㠼㸳@㠼㸳C㠼㸳㠼㹢㠼㸲Ƃ戼㸵㠼㸲ĔF㠼㹥愼㹦㠼㸲戼㸳㠼㸲攼㹡㠼㸲Ă愼㸲㠼㸲܂戼㸹㠼㸲昼㸱㠼㸱B
Note that when your model with a transformed response variable, your predictions will also be in the transformed value. You will likely want to re-transform your predicted values to their normal state so that decision-makers can interpret the results. The following code can do this for your:
# log transform a value
y <- log(10)
# re-transforming the log-transformed value
exp(y)[1] 10
# Box Cox transform a value
y <- forecast::BoxCox(10,lambda)
# Inverse box cox function
inv_box_cox <- function(x,lambda){
if(lambda==0) exp(x)
else (lambda*x+1)^(1/lambda)
}
# re-transforming the box cox-transformed values
inv_box_cox(y, lambda)[1] 10
attr(,"lambda")
[1] -0.3067918
library(dplyr)
data.frame(
Normal = train_1$Sale_Price,
Log_Transform = train_log_y,
BoxCox_Transform = train_bc_y
) %>%
gather(Transform, Value) %>%
mutate(Transform = factor(Transform, levels = c("Normal", "Log_Transform", "BoxCox_Transform"))) %>%
ggplot(aes(Value, fill = Transform)) +
geom_histogram(show.legend = FALSE, bins = 40) +
facet_wrap(~ Transform, scales = "free_x")
Many models requrire all predictor variables to be numeric. Consequently, we need to transform any categorical variables into numeric representations so that these algorithms can compute. Some packages automate this process (i.e., h2o, glm, caret), while others do not (i.e., glmnet, keras). Furthermore, there are many ways to encode categorical variables as numeric representations (i.e., one-hot, ordinal, binary, sum, Helmert).
The most common is referred to as one-hot encoding, where we transpose our categorical variables so that each level of the feature is represented as a boolean value. For example, one-hot encoding variable x in the following.
set.seed(123)
ex1 <- data.frame(id=1:8, x=sample(letters[1:3],8, replace=TRUE))
knitr::kable(ex1)| id | x |
|---|---|
| 1 | a |
| 2 | c |
| 3 | b |
| 4 | c |
| 5 | c |
| 6 | a |
| 7 | b |
| 8 | c |
results in the following representation:
If your response has negative values, you can use the Yeo-Johnson transformation. To apply, use `car::powerTransform` to identify the lambda, `car::yjPower` to apply the transformation, and `VGAM::yeo.johnson` to apply the transformation and/or the inverse transformation. 'block' 㠼㸲́A㤼㸳攼㸰㤼㸵㤼㸴㠼㸳R㠼㸳}㠼㸳㤼㸳㠼㸳h㠼㸲܂戼㹤㠼㸲͊O㤼㸵㤼㸴㠼㸳R㠼㸳}㠼㸳㤼㸳㠼㸳h㠼㸱A
㤼㸱㠼㸰㠼㹤攼㹣㠼㸹\㠼㸲ȃv㠼㸳㠼㹤㠼㸳O㠼㸳㠼㸹㠼㸳㠼㸰㠼㸲܂戼㹤㠼㸲̓o㠼㸳b㠼㸳` 㠼㸳t㠼㸳@㠼㸳C㠼㸳㠼㹢㠼㸲Ƃ戼㸵㠼㸲ĔF㠼㹥愼㹦㠼㸲戼㸳㠼㸲攼㹡㠼㸲Ă愼㸲㠼㸲܂戼㸹㠼㸲昼㸱㠼㸱B
This is called less than full rank encoding where we retain all variables for each level of x. However, this creates perfect collinearity which causes problems with some machine learning algorithms (i.e., generalized regression models, neural networks). Alternatively, we can create full-rank one-hot encoding by dropping one of the levels (level a has been dropped):
| id | x.b | x.c |
|---|---|---|
| 1 | 0 | 0 |
| 2 | 0 | 1 |
| 3 | 1 | 0 |
| 4 | 0 | 1 |
| 5 | 0 | 1 |
| 6 | 0 | 0 |
| 7 | 1 | 0 |
| 8 | 0 | 1 |
If you needed to manually implement one-hot encoding yourself you can with caret::dummyVars. Sometimes you may have a feature level with very few observations and all these observations show up in the test set but not the training set. The benefit of using dummyVars on the full data set and then applying the result to both the train and test data sets is that it will guarantee that the same features are represented in both the train and test data.
# full rank one-hot encode - recommended for generalized linear models and
# neural networks
full_rank <- dummyVars( ~ ., data = ames, fullRank = TRUE)
train_oh <- predict(full_rank, train_1)
test_oh <- predict(full_rank, test_1)
# less than full rank --> dummy encoding
dummy <- dummyVars( ~ ., data = ames, fullRank = FALSE)
train_oh <- predict(dummy, train_1)
test_oh <- predict(dummy, test_1)Two things to note: * since one-hot encoding add new features it can significantly increase the dimensionality of our data. If you have a data set with many categorical variables and those categorical variables in turn have many unique levels, the number of features can explode. In these cases, you may want to explorer ordinal encoding of your data.
h2o you do not need to explicitly encode your categorical variables but you can overrride the default encoding. This can be considered a tuning parameter as some encoding approaches will improve modeling accuracy over other encodings. See the encoding options for h2o hereReference http://docs.h2o.ai/h2o/latest-stable/h2o-docs/welcome.html
Some models (K-NN, SVMs, PLS, Neural networks) require that the predictor variables have the same units. Centering, and scaling can be used for this purpose, and is often referred to as standardizing the features. Standardizing numeric variables results in zero mean and unit variance, which provides a common comparable unit of measure across all the variables.
Some packages have built-in arguments (i.e., h2o, caret) to standardize and some do not(i.e., glm, keras). If you need to manually standardize your variables you can use the preProcess function provided by the caret package. For example, here we center and scale our Ames predictor variables.
one_hot <- dummyVars(~., ex1, fullRank=F)
ex2 <- predict(one_hot, ex1)
knitr::kable(ex2)
| id | x.a | x.b | x.c |
|---|---|---|---|
| 1 | 1 | 0 | 0 |
| 2 | 0 | 0 | 1 |
| 3 | 0 | 1 | 0 |
| 4 | 0 | 0 | 1 |
| 5 | 0 | 0 | 1 |
| 6 | 1 | 0 | 0 |
| 7 | 0 | 1 | 0 |
| 8 | 0 | 0 | 1 |
It is important that you standardize the test data based on the training mean and variance values of each feature. This minimizes data leakage.'block' 㠼㸲́A㤼㸳攼㸰㤼㸵㤼㸴㠼㸳R㠼㸳}㠼㸳㤼㸳㠼㸳h㠼㸲܂戼㹤㠼㸲͊O㤼㸵㤼㸴㠼㸳R㠼㸳}㠼㸳㤼㸳㠼㸳h㠼㸱A
㤼㸱㠼㸰㠼㹤攼㹣㠼㸹\㠼㸲ȃv㠼㸳㠼㹤㠼㸳O㠼㸳㠼㸹㠼㸳㠼㸰㠼㸲܂戼㹤㠼㸲̓o㠼㸳b㠼㸳` 㠼㸳t㠼㸳@㠼㸳C㠼㸳㠼㹢㠼㸲Ƃ戼㸵㠼㸲ĔF㠼㹥愼㹦㠼㸲戼㸳㠼㸲攼㹡㠼㸲Ă愼㸲㠼㸲܂戼㸹㠼㸲昼㸱㠼㸱B
There are some alternative transformation that you can perform:
# identify only the predictor variables
features <- setdiff(names(train_1), "Sale_Price")
features
[1] "MS_SubClass" "MS_Zoning" "Lot_Frontage" "Lot_Area"
[5] "Street" "Alley" "Lot_Shape" "Land_Contour"
[9] "Utilities" "Lot_Config" "Land_Slope" "Neighborhood"
[13] "Condition_1" "Condition_2" "Bldg_Type" "House_Style"
[17] "Overall_Qual" "Overall_Cond" "Year_Built" "Year_Remod_Add"
[21] "Roof_Style" "Roof_Matl" "Exterior_1st" "Exterior_2nd"
[25] "Mas_Vnr_Type" "Mas_Vnr_Area" "Exter_Qual" "Exter_Cond"
[29] "Foundation" "Bsmt_Qual" "Bsmt_Cond" "Bsmt_Exposure"
[33] "BsmtFin_Type_1" "BsmtFin_SF_1" "BsmtFin_Type_2" "BsmtFin_SF_2"
[37] "Bsmt_Unf_SF" "Total_Bsmt_SF" "Heating" "Heating_QC"
[41] "Central_Air" "Electrical" "First_Flr_SF" "Second_Flr_SF"
[45] "Low_Qual_Fin_SF" "Gr_Liv_Area" "Bsmt_Full_Bath" "Bsmt_Half_Bath"
[49] "Full_Bath" "Half_Bath" "Bedroom_AbvGr" "Kitchen_AbvGr"
[53] "Kitchen_Qual" "TotRms_AbvGrd" "Functional" "Fireplaces"
[57] "Fireplace_Qu" "Garage_Type" "Garage_Finish" "Garage_Cars"
[61] "Garage_Area" "Garage_Qual" "Garage_Cond" "Paved_Drive"
[65] "Wood_Deck_SF" "Open_Porch_SF" "Enclosed_Porch" "Three_season_porch"
[69] "Screen_Porch" "Pool_Area" "Pool_QC" "Fence"
[73] "Misc_Feature" "Misc_Val" "Mo_Sold" "Year_Sold"
[77] "Sale_Type" "Sale_Condition" "Longitude" "Latitude"
# pre-process estimation based on training features
pre_process <- preProcess(
x=train_1[,features],
method=c("center","scale")
)
# apply to both training & test
train_x <- predict(pre_process, train_1[,features])
test_x <- predict(pre_process, test_1[,features])
For example, the following normalizes predictors with Box Cox transformation, center and scales continuous variables, peforms principal component analysis to reduce the predictor dimensions, and removes predictors with near zero variance.
# identify only the predictor variables
features <- setdiff(names(train_1), "Sale_Price")
# pre-process estimation based on training features
pre_process <- preProcess(
x = train_1[, features],
method = c("BoxCox", "center", "scale", "pca", "nzv")
)
# apply to both training & test
train_x <- predict(pre_process, train_1[, features])
test_x <- predict(pre_process, test_1[, features])There are many packages to perform machine learning and there are almost always more than one to perform each algorithm (i.e., there are over 20 packages to perform random forecasts). There are pros and cons to each package: some may be more computationally efficient while others may have more hyperparameter tuning options. Future chapters will expose you to many of the packages and algorithms that perform and scale best to most organization’s problems and data sets. Just realize there are more ways than one to skin a scream cat.
For example, these three functions will all produce the same linear regression model output.
lm.lm <- lm(Sale_Price ~., data=train_1)
lm.glm <- glm(Sale_Price ~ ., data = train_1, family = gaussian)
lm.caret <- train(Sale_Price ~ ., data = train_1, method = "lm")One thing you will notice throughout this guide is that we cano specify our model formulation in different ways. In the above examples, we use the model formulation (Sale_Price ~., which says explain Sale_Price based on all features) approach. Alternative approaches, which you will see more often throughout this guide, are the matrix formulation and variable name specification approaches.
Matrix formulation requires that we separate our response variable from our features. For example, in the regularization session, we’ll use glmnet which requires our features (x) and response (y) variable to be specified separately.
library(glmnet)
# get feature names
features <- setdiff(names(train_1), "Sale_Price")
# create feature and response set
train_x <- train_1[, features]
train_y <- train_1$Sale_Price
# example of matrix formulation
glmnet.m1 <- glmnet(x = train_x, y = train_y)Alternatively, h2o uses variable name specification where we provide all the data combined in one training_frame but we specify the features and response with character strings:
# create variable names and h2o training frame
y <- "Sale_Price"
x <- setdiff(names(train_1),y)
train.h2o <- as.h2o(train_1)
# example of variable name specification
h2o.m1 <- h2o.glm(x = x, y = y, training_frame = train.h2o)Hyperparameters control the level of model complexity. Some algorithms have many tuning parameters while others have only one or two. Tuning can be a good thing as it allows us to transform our model to better align with patterns within our data. For example, the simple illustration below shows how the more flexible model aligns more closely to the data than the fixed linear model.